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Abstract —In this paper, we present an adaptive nonparametric 
solution to the image parsing task, namely annotating each image 
pixel with its corresponding category label. For a given test 
image, first, a locality-aware retrieval set is extracted from the 
training data based on super-pixel matching similarities, which 
are augmented with feature extraction for better differentiation 
of local super-pixels. Then, the category of each super-pixel is 
initialized by the majority vote of the &>nearest-neighbor super¬ 
pixels in the retrieval set. Instead of fixing k as in traditional 
non-parametric approaches, here we propose a novel adaptive 
nonparametric approach which determines the sample-specific 
k for each test image. In particular, k is adaptively set to be 
the number of the fewest nearest super-pixels which the images 
in the retrieval set can use to get the best category prediction. 
Finally, the initial super-pixel labels are further refined by con¬ 
textual smoothing. Extensive experiments on challenging datasets 
demonstrate the superiority of the new solution over other state- 
of-the-art nonparametric solutions. 

Index Terms —image parsing, scene understanding, adaptive 
nonparametric method. 


I. Introduction 

I MAGE parsing, also called scene understanding or scene 
labeling, is a fundamental task in computer vision literature 

0, 0, 0, 0, 0, 0, 0, 0, 0, m, ED, E2- However, 

image parsing is very challenging since it implicitly integrates 
the tasks of object detection, segmentation, and multi-label 
recognition into one single process. Most current solutions to 
this problem follow the two-step pipeline. First, the category 
label of each pixel is initially assigned by using a certain 
classification algorithm. Then, contextual smoothing is applied 
to enforce the contextual constraints among the neighboring 
pixels. The algorithms in the classification step can be roughly 
divided into two categories, namely parametric methods and 
nonparametric methods. 

Parametric methods Fulkerson et al. ED constructed 
an SVM classifier on the bag-of-words histogram of local 
features around each super-pixel. Tighe et al. E3 combined 
super-pixel level features with per-exemplar sliding window 
detectors to improve the performance. Socher et al. [15] 
proposed a method to aggregate super-pixels in a greedy 
fashion using a trained scoring function. The originality of 
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this approach is that the feature vector of the combination 
of two adjacent super-pixels is computed from the feature 
vectors of the individual super-pixels through a trainable 
function. Farabet et al. ED later proposed to use a multiscale 
convolutional network trained from raw pixels to extract dense 
feature vectors that encode regions of multiple sizes centered 
at each pixel. 

Nonparametric methods Different from parametric meth¬ 
ods, nonparametric or data-driven methods liaise with k- 
nearest neighbors classifiers 0 , 0 Fiu et al. 0 proposed 
a nonparametric image parsing method based on estimating 
SIFT Flow, a dense deformation field between images. Given 
a test and a training image, the annotated category labels of 
the training pixels are transferred to the test ones via pixel cor¬ 
respondences. However, inference via pixel-wise SIFT Flow is 
currently very complex and computationally expensive. There¬ 
fore, Tighe et al. 0 further transferred labels at the level of 
super-pixels, or coherent image regions produced by a bottom- 
up segmentation method. In this scheme, given a test image, 
the system searches for the top similar training images based 
on global features. The super-pixels of the most similar images 
are obtained as a retrieval set. Then the label of each super¬ 
pixel in the test image is assigned based on the corresponding 
k most similar super-pixels in the retrieval set. Eigen et al. 
im further improved (5) by learning per-descriptor weights 
that minimize classification error. In order to improve the 
retrieval set, Singh et al. EH used adaptive feature relevance 
and semantic context. They adopted a locally adaptive distance 
metric which is learned at query time to compute the relevance 
of individual feature channels. Using the initial labelling as 
a contextual cue for presence or absence of objects in the 
scene, they proposed a semantic context descriptor which 
helped refine the quality of the retrieval set. In a different 
work, Yang et al. m looked into the long-tailed nature of 
the label distribution. They expanded the retrieval set by rare 
class exemplars and thus achieved more balanced super-pixel 
classification results. Meanwhile, Zhang et al. [20] proposed 
a method which exploits partial similarity between images. 
Namely, instead of retrieving global similar images from the 
training database, they retrieved some partially similar images 
so that for each region in the test image, a similar region exists 
in one of the retrieved training images. 

Due to the limited discriminating power of classification 
algorithms, the output initial labels of pixels may be noisy. To 
further enhance the label accuracy, contextual smoothing is 
generally used to exploit global contexts among the pixels. 
Rabinovich et al. 0 incorporated co-occurrence statistics 
of category labels of super-pixels into the fully connected 
Conditional Random Field (CRF). Galleguillos et al. iflOl 
proposed to exploit the information of relative location such 
as above, beside, or enclosed between super-pixel categories. 
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Fig. 1. The flowchart of our proposed nonparametric image parsing. Given a test image, we segment the image into super-pixels. Then the locality-aware 
retrieval set is extracted by using super-pixel matching, and the initial category label of each super-pixel is assigned by adaptive nonparametric super-pixel 
classification. The initial labels, in combination with contextual smoothing, give a dense labeling of the test image. The red rectangle highlights the new 
contributions of this work, and removing the keywords of locality-aware and adaptive in red then leads to the traditional nonparametric image parsing pipeline. 


Meanwhile, Myeong et al. (6) introduced a context link view 
of contextual knowledge, where the relationship between a 
pair of annotated super-pixels is represented as a context link 
on a similarity graph of regions, and link analysis techniques 
are used to estimate the pairwise context scores of all pairs 
of unlabeled regions in the input image. Later, CD proposed 
a method to transfer high-order semantic relations of objects 
from annotated images to unlabeled images. Zhu et al. ETI 
proposed the hierarchical image model composed of rect¬ 
angular regions with parent-child dependencies. This model 
captures large-distance dependencies and is solved efficiently 
using dynamic programming. However, it supports neither 
multiple hierarchies, nor dependencies between variables at 
the same level. In another work, Tu et al. (221 introduced a 
unified framework to pool the information from segmentation, 
detection and recognition for image parsing. They have to 
spend much effort to design such complex models. Due to 
the complexity, the proposed model might not scale well with 
different datasets. 

In this work, our focus is placed on nonparametric solutions 
to the image parsing problem. However, there are several 
shortcomings in existing nonparametric methods. First, it is 
often quite difficult to get globally similar images to form 
the retrieval set. Also by only considering global features, 
some important local components or objects may be ignored. 
Second, k is fixed empirically in advance in such a nonpara¬ 
metric image parsing scheme. Tighe et al. 12 reported the best 
results by varying k on the test set. However, this strategy is 
impractical since the ground-truth labels are not provided in 
the testing phase. Therefore, the main issues in the context 
of the nonparametric image parsing are 1) how to get a good 
retrieval set, and 2) how to choose a good k for initial label 
transfer. In this work, we aim to improve both aspects, and 
the main contributions of this work are two-fold. 

1) Unlike the traditional retrieval set which consists of 
globally similar images, we propose the locality-aware 
retrieval set. The locality-aware retrieval set is extracted 
from the training data based on super-pixel matching 
similarities, which are augmented with feature extraction 
for better differentiation of local super-pixels. 

2) Instead of fixing k as in traditional nonparametric meth¬ 
ods, we propose an adaptive method to set the sample- 
specific k as the number of the fewest nearest neighbors 
which similar training super-pixels can use to get their 
best category label predictions. 


TABLE I 

The list of all super-pixel’s features. 


Type 

Dim 

Type 

Dim 

Centered mask 

64 

SIFT histogram top 

100 

Bounding box 

2 

SIFT histogram right 

100 

Super-pixel area 

1 

SIFT histogram left 

100 

Absolute mask 

64 

Mean color 

3 

Top height 

1 

Color standard deviation 

3 

Texton histogram 

100 

Color histogram 

33 

Dilated texton histogram 

100 

Dilated color histogram 

33 

SIFT histogram 

100 

Color thumbnail 

192 

Dilated SIFT histogram 

100 

Masked color thumbnail 

192 

SIFT histogram bottom 

100 

GIST 

320 


II. Adaptive Nonparametric Image Parsing 
A. Overview 

Generally, for nonparametric solutions to the image parsing 
task, the goal is to label the test image at the pixel level based 
on the content of the retrieval set, but assigning labels on a 
per-pixel basis as in (4|, ifTbl would be too inefficient. In this 
work, we choose to assign labels to super-pixels produced by 
bottom-up segmentation as in 0. This not only reduces the 
complexity of the problem, but also gives better spatial support 
for aggregating features belonging to a single object than, say, 
fixed-size square patches centered at each pixel in an image. 

The training images are first over-segmented into super¬ 
pixels by using the fast graph-based segmentation algorithm 
of j23i and their appearances are described using 20 different 
features similar to those of s The complete list of super- 
pixel’s features is summarized in Table [T| Each training super¬ 
pixel is assigned a category label if 50% or more of the super¬ 
pixel overlaps with a ground truth segment mask of that label. 
For each super-pixel, we perform feature extraction and then 
reduce the dimension of the extracted feature. 

For the test image, as illustrated in Figure [lj over¬ 
segmentation and super-pixel feature extraction are also con¬ 
ducted. Next, we perform the super-pixel matching process to 
obtain the locality-aware retrieval set. The adaptive nonpara¬ 
metric super-pixel classification is proposed to determine the 
initial label of each super-pixel. Finally, the graphical model 
inference is performed to preserve the semantic consistency 
between adjacent pixels. More details of the proposed frame¬ 
work, namely the locality-aware retrieval set, adaptive non¬ 
parametric super-pixel classification, and contextual smooth¬ 
ing, are elaborated as follows. 
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Fig. 2. The process to extract the retrieval set by super-pixel matching. The test image is first oversegmented into super-pixels. Then, we compute the 
similarity between the test image and each training image as described in Algorithm [I] (Please view in high 400% resolution). 


B. Locality-aware Retrieval Set 

For nonparametric image parsing, one important step of 
parsing a test image is to find a retrieval set of training images 
that will serve as the reference of candidate super-pixel level 
annotations. This is done not only for computational efficiency, 
but also to provide scene-level context for the subsequent 
processing steps. A good retrieval set should contain images 
of a similar scene type as that of the test image, along with 
similar objects and spatial layouts. Unlike is where global 
features are used to obtain the retrieval set, we utilize the 
super-pixel matching as illustrated in Figure [2] The motivation 
is that sometimes it may be difficult to get globally similar 
images, especially when the training set is not big enough, yet 
locally similar ones are easier to obtain; also sometimes if only 
global features are considered for retrieval set selection, some 
important local components or objects may be ignored. In this 
work, the retrieval set is selected based on local similarity 
measured over super-pixels. To enhance the discriminating 
power of super-pixels, we utilize Linear Discriminant Analysis 
(LDA) 1:24ft for feature reduction to a lower feature dimension. 
Then we use the augmented super-pixel similarity instead of 
global similarity to extract the retrieval set. 

Denote x G M na:Xl as the original feature vector of the 
super-pixel, where n x is the dimension of the feature vector. 
The corresponding feature vector x after the feature reduction 
is computed as, 

x = Wx, (1) 

where W is the transformation matrix. In particular, LDA 
looks for the directions that are most effective for discrimina¬ 
tion by minimizing the ratio between the intra-category (S w ) 
and inter-category (Sb) scatters: 


. \W T S w W\ 

(2) 

= arg T| wT Sb w\’ 

N 


yy*i - x ci )(xi — x ci ) T , 

(3) 

i=1 


N c 


n c (x c — x)(x c — x ) T , 

(4) 

C=1 



where N is the number of super-pixels in all training images, 
N c is the number of categories, n c is the number of super¬ 
pixels for the c-th category, a^, Vi G {1, • • • , TV}, is the feature 
vector of one training super-pixel, c* is the category label of 
the i-th super-pixel in the training images, x is the mean of 
feature vector of training super-pixels, and x c is the mean of 


the c-th category. Note that the category label of each super¬ 
pixel is obtained from the ground-truth object segment with 
the largest overlapping with the super-pixel. As shown in [1241 . 
the projection matrix W* is composed of the eigenvectors of 
S~ 1 Sb- Note that there are at most N c — 1 eigenvectors with 
non-zero real corresponding eigenvalues since there are only 
N c points to compute Sb . In other words, the dimensionality 
of W is N c — 1 x n x . Therefore, LDA naturally reduces the 
feature dimension to N c —1 in the image parsing task. Since 
the category number is much smaller than the feature number, 
the benefits of the reduced dimension include the shrinkage 
of memory storage and the removal of those less informative 
features for consequent super-pixel matching. Obviously the 
reduction of feature dimension is also beneficial to the nearest 
super-pixel search in the super-pixel classification stage. 

The procedure to obtain the retrieval set is summarized in 
Algorithm [T] Denote n q as the number of super-pixels in 
the test image, n] • G M as the number of super-pixels for 
the j -th training image, and TV/ as the number of training 
images. We impose the nature constraint that one super-pixel 
in a training image is matched with only one super-pixel of 
the test image. We denote S as the unique index set which 
stores the indices of the already matched super-pixels, v as 
the similarity vector between the test image and all training 
images, Q G R^ Nc ~ 1 ) xn q as the feature matrix for all the 
super-pixels in the test image, T G n j ) as the 

feature matrix for all the super-pixels in the training set, and 
m G as the mapping index between the super-pixel 

and the corresponding training image. As aforementioned, the 
over-segmentation over the image is performed by using [23]. 
Then we extract the corresponding features similarly as 0 for 
each super-pixel and use LDA to reduce the feature dimension. 

We match each super-pixel in the test image with all super¬ 
pixels in the training set. In order to reduce the complexity, we 
perform Knn to find the nearest k m super-pixels in the training 
images for the i-th super-pixel in the test image. The Euclidean 
distance is used to calculate the dissimilarity between two 
super-pixels. As a result, we have rji G R km as the indices of 
the returned nearest super-pixels of the i-th test super-pixel, 
and A* G R km as the corresponding distances of the returned 
nearest super-pixels to the i-th test super-pixel. We remove 
the super-pixels in S from r]i , where S includes the training 
super-pixels matched by the first i—1 test super-pixels. There 
may be more than one super-pixel from one training image, 
thus RefineIndexSet is performed to keep the nearest one. 
Note that | • | denotes the number of the elements in an array. 
Then, the index set S is updated by adding 

The function FindImageIndex is invoked to retrieve 
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Fig. 3. The distribution of best ks for all training images in the SIFTFlow dataset. It can be observed that there is no dominant k from 1 to 50. 


Algorithm 1 Locality-aware Retrieval Set Algorithm 

1: parameters: n q , n t , iVj , Q , T. 

2: The unique index set S' = 0. 

3: v = 0 e R Ni . 

4: for i = 1: n q do 

5: [r ]*, A,]^KNN (Qi, T, km); 

6: rji <— rji\S; 

7: if ? 7 i / 0 then 

8: 7]i< —REFINElNDEXSET(7/i); 

9: li ^FlNDlMAGElNDEX(?7i); 

10: v(Ii) <- v(Ii) + l./A*( 77 *); 

11: S<-S(Jw 

12: end if 

13: end for 

14: v =NORMALIZEAndSORT(u). 

15: k r — argmin u Vj > r. 

E,-ii 

16: return top k r training images. 

17: function REFlNElNDEXSET(ry, A) 

18: d — oo G R Ni . 

19: r = 0. 

20: for i = 1: \rj\ do 

21: if d(m(rji)) > A i then 

22: d(m(rji)) = A*; 

23: else 

24: r = r|ji; 

25: end if 

26: end for 

27: return I\ 

28: end function 

29: function FlNDlMAGElNDEX(ry) 

30: r = ooeR W . 

31: for i = 1: \rj\ do 

32: Ti = 

33: end for 

34: return T. 

35: end function 


of SIFTFlow training set varies from 5 to 193. Therefore we 
perform NORMALIZEANDSORT to obtain the final similarity 
vector. Namely, for each training image j, v 3 is divided by 
min (n q , rij). The retrieval set then includes the top k r training 


images by 


E k r 
3 = 

Efii vj 


> r, where the parameters k m and r 


are selected by the grid search over the training set based 
on the leave-one-out strategy. Namely, we choose a pair of 
r G {0.1,..., 0.5} with step size 0.1, and km G {500,..., 2500} 
with step size 500 and perform the following adaptive non- 
parametric super-pixel classification for all images in the 
training set. The leave-one-out strategy means that when one 
training image is selected as a test image, the rest of training 
images is used as the corresponding training set. 


C. Adaptive Nonparametric Super-pixel Classification 

Adaptive nonparametric super-pixel classification aims to 
overcome the limitation of the traditional k -nearest neighbor 
(fc-NN) algorithm which usually assigns the same number 
of nearest neighbors for each test sample. For nonparametric 
algorithms, the label of each super-pixel in the test image is 
assigned based on the corresponding similar k super-pixels in 
the retrieval set. Our improved fc-NN algorithm focuses on 
looking for the suitable k for each test sample. 

Basically the sample-specific k of each test image is prop¬ 
agated from its similar training images. In particular, each 
training image t retrieved by the super-pixel matching process, 
is considered as one test image, while the left TV/ —1 images in 
the training set are referred to the corresponding training set. 
Then we perform super-pixel matching to obtain the retrieval 
set for t and assign the label if of the i -th super-pixel by the 
majority vote of the k nearest super-pixels in the retrieval set, 


36: function NormalizeAndSort(u) 
37: T = ooGM M . 

38: for i = 1: \ v\ do 

39: Ti = Vi/ min (nl,n q ); 

40: end for 

41: T = sort(T). 

42: return T. 

43: end function 


Z* = argmaxL(fc,/*), (5) 

where L is the likelihood ratio for the i -th super-pixel to have 
the category li based on the k nearest super-pixels and defined 
as below, 


L(k , x = P(i\h,k) = n(lj,NN(i,k))/n(l u D) 

1 5 l) P(i\k,k) n(li,NN(i,k))/n(l u D)' 


( 6 ) 


Here n(l^ NN{i , fc)) is the number of super-pixels with class 
the corresponding image index of rp. Then we update the label li in the fc nearest super-pixels of the i-th super-pixel 
similarity vector v since the number of super-pixels is not the in the retrieval set, li is the set of all labels excluding U , 
same for every image. For example, the number of super-pixels and D is the set of all super-pixels in the whole training set. 
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NN(i , k ) consists of k nearest super-pixels of the i-th super¬ 
pixel from the retrieval set. Then we compute the per-pixel 
accuracy of each retrieved training image t for different ks. 
We denote A t k as the per-pixel performance (the percentage 
of all ground-truth pixels that are correctly labeled) of the 
training image t with the parameter value k. We vary k from 
1 to 50 with step size 1, k G {1, 2, 3, • • • , 50}. As can be 
observed in Figure [3j there is no dominant k from 1 to 50 in 
the overall SIFTFlow training set. It motivates the necessity of 
adaptive k nearest neighbors for the nonparametric super-pixel 
classification process. Thus, for each test image, we assign its 
k by transferring ks of the similar images returned by the 
super-pixel matching process, 

k r 

k * = arg max A tk , (7) 

k z —' 

t= 1 

where k r is the number of images in the retrieval set for the 
test image. Then based on selected &*, the initial label of a 
super-pixel in the test image is obtained in the same way as 
in Eqn. (4). 

D. Contextual Smoothing 

Generally, the initial labels for the super-pixels may still be 
noisy, and these labels need be further refined with global con¬ 
text information. The contextual constraints are very important 
for parsing images. For example, a pixel assigned with “car” 
is likely connected with “road”. Therefore, the initial labels 
are smoothened with an MRF energy function defined over 
the field of pixels: 

E(l) = ]T E d (i, U) + Ay E s (h, lj), (8) 

idV eij^E 

where V is the set of all pixels in the image, E is the set 
of edges connecting adjacent pixels, and A is a smoothing 
constant. The data term is defined as follows 

E d {i,k) = -\ogL(k*,l sp (i)), (9) 

where sp(i) means the super-pixel containing the i-th super- 
pixel and the L function is defined in Eqn. (5). The MRF 
model also includes the smoothness constraint reflecting the 
spatial consistency (pixels or super-pixels close to each other 
are most likely to have similar labels). Therefore, the smooth¬ 
ing term E s (li, lj) imposes a penalty when two adjacent pixels 
( Pi , Pj ) are similar but are assigned with different labels ft, 
lj). E s is defined based on probabilities of label co-occurrence 
and biases the neighboring pixels to have the same label in the 
case that no other information is available, and the probability 
depends on the edge of the image: 

E s (h-, ij) = -in x log ( TMA±EhM ) x S [k + ij], 

7 (10) 

where P(li\lj) is the conditional probability of one pixel 
having label li given that its neighbor has label lj , estimated 
by counts from the training set. &j is defined based on the 
normalized gradient value of the neighboring pixels: 

&,= v % , (id 

2^e pq eE v pq 


where V^- = ||I(i) — I(j)\\ 2 is the £2 norm of the gradient 
of the test image I at a pixel i and its neighbor pixel j. 
The stronger the luminance edge is, the more likely the 
neighboring pixels may have different labels. Multiplication 
with the constant Potts penalty S[li lj] is necessary to ensure 
that this energy term is semi-metric as required by graph cut 
inference [251. We perform the inference using the a — /? swap 

algorithm (22, (351, (221 

III. Experiments 

A. Datasets and Evaluation Metrics 

In this section, our approach is validated on two challenging 
datasets: SIFTFlow (4] and 19-Category LabelMe (28j. 

SIFTFlow datase^is composed of 2,688 images that have 
been throughly labeled by LabelMe users. The image size 
is 256 x 256 pixels. Liu et al. (2 split this dataset into 
2,488 training images and 200 test images and used synonym 
correction to obtain 33 semantic labels (sky, building, tree, 
mountain, road, sea, field, car, sand, river, plant, grass, window, 
sidewalk, rock, bridge, door, fence, person, staircase, awning, 
sign, boat, crosswalk, pole, bus, balcony, streetlight, sun, bird, 
cow, dessert, and moon). 

19-Category LabelMe datase^Jain et al. f28l randomly 
collected 350 images from LabelMe (8] with 19 categories 
(grass, tree, field, building, rock, water, road, sky, person, car, 
sign, mountain, ground, sand, bison, snow, boat, airplane, and 
sidewalk). This dataset is split into 250 training images and 
100 test images. 

We evaluate our approach on both sets, but perform addi¬ 
tional analysis on the SIFTFlow dataset since it has a larger 
number of categories and images. In evaluating image parsing 
algorithms, there are two metrics that are commonly used: per- 
pixel and per-category classification rate. The former rates the 
total proportion of correctly labeled pixels, while the latter 
indicates the average proportion of correctly labeled pixels in 
each object category. If the category distribution is uniform, 
then the two would be the same, but this is not the case 
for real-world scenes. Note that for all experiments, the A 
is empirically set as 16 in the contextual smoothing process. 
km and r are set as 1000 and 0.3, respectively. In all of 
our experiments, we use Euclidean distance metric to find the 
nearest neighbors. 

B. Performance on the SIFTFlow Dataset 

Comparison of our algorithm with state-of-the-arts Ta¬ 
ble [II| reports per-pixel and average per-category rates for 
image parsing on the SIFTFlow dataset. Even though the 
nonparametric methods are our main baselines, we still list 
parametric methods for reference. Our proposed method out¬ 
performs the baselines by a remarkable margin. We did not 
compare our work with El and El since El uses a 
different set of super-pixel’s features whereas El utilizes 
the extra data to balance the distribution of the categories 
in the retrieval set. Compared with our initial super-pixel 

1 http://people.csail.mit.edu/celiu/LabelTransfer/LabelTransfer.rar 

2 http://www.umiacs.umd.edu/~ajain/dataset/LabelMesubsetdataset.zip 
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Fig. 4. (Top) Label frequencies for the pixels in the SIFTFlow training set. (Bottom) The per-category classification rates of different ks and our adaptive 
nonparametric method on the SIFTFlow dataset. The categories ‘bird’, ‘cow’, ‘dessert’, and ‘moon’ are dropped from the figure since they are not present in 
the test split. (Please view in high 400% resolution). 


TABLE II 

Performance comparison of our algorithm with other 
ALGORITHMS ON THE SIFTFLOW DATASET 0. PER-PIXEL RATES AND 
AVERAGE PER-CATEGORY RATES ARE PRESENTED. THE BEST 
PERFORMANCE VALUES ARE MARKED IN BOLD. 


Algorithm | Per-Pixel (%) Per-Category (%) 

Parametric Baselines 

Tighe et al. (14J 

78.6 

39.2 

Farabet et al. 1161 

78.5 

29.6 

Nonparametric Baselines 

Liu et al. (4J 

74.8 

- 

Tighe et al. (3 

76.3 

28.8 

Tighe et al. (5) (adding geomet¬ 

ric information) 

76.9 

29.4 

Myeong et al. (TTJ 

76.2 

29.6 

Eigen et al. In) 

77.1 

32.5 

Our Proposed Adaptive Nonparametric / 

Algorithm 

Super-pixel Classification 

77.2 

34.9 

Contextual Smoothing 

78.9 

34.0 


TABLE III 

Performance comparison of different ks and our algorithm on 
the SIFTFlow dataset Q]|. Per-pixel rates and average 

PER-CATEGORY RATES ARE PRESENTED. 


Parameter 

Per-Pixel (%) 

Per-Category (%) 

k = 1 

70.2 

31.9 

k = 5 

76.6 

34.8 

k = 10 

77.5 

34.6 

o 

<N 

II 

77.8 

33.5 

k = 30 

77.9 

33.3 

o 

ii 

77.9 

30.6 

o 

m 

II 

77.8 

29.5 

o 

no 

II 

77.5 

28.6 

o 

o 

II 

77.8 

28.5 

o 

00 

II 

77.5 

28.2 

o 

ON 

II 

111 

27.2 

k = 100 

76.9 

26.8 

Adaptive k in Our Algorithm 

78.9 

34.0 


classification result, the final contextual smoothing improves 
overall per-pixel rates on the SIFTFlow dataset by about 1.7%. 
Average per-category rates drop slightly due to the contextual 
smoothing on some of the smaller classes. Note that Tighe et 
al. lfl4l improved 0 by adding extensively multiple detectors 
(their performance reaches 78.6%). The addition of many ob¬ 
ject detectors brings a better per-category performance but also 
increases the processing time since running object detection is 
very time-consuming. Note that, to train the object detectors, 
m must use extra data. Also, fT4l utilizes SVM instead of k- 
NN as in our work that may bring better classification results, 


especially for some rare categories. Meanwhile, our proposed 
method improves 0 with a simpler solution and even achieves 
a better performance in terms of per-pixel rate (78.9%). Also, 
our method performs better than lITbl which deployed heavily 
deep learning features. 

Performance of different ks The impact of different 
ks is further investigated on the SIFTFlow dataset. In this 
experiment, the parameter k varies from 1 to 100. LDA 
and super-pixel matching are utilized in order to keep fair 
comparison with our adaptive nonparametric method. Table [HI] 
summarizes the performance of different ks on both per-pixel 
and per-category criteria. The relationship between per-pixel 
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Test image 


Locality-aware 
Retrieval Set 



#2 



#3 



#4 



Global Matching 
Retrieval Set [5] 



GIST-based Matching 
Retrieval Set [4] 







Fig. 5. Top 4 exemplar retrieval results of super-pixel matching, global matching 0, and GIST-based matching f4J. (a) Global matching returns “tall building” 
and “open country” scenes, and GIST-based matching obtains “inside city” and “mountain”. Meanwhile, our method obtains the reasonable images of “urban 
street”, (b) The “open country” images are retrieved in GIST-based matching and the “sunset coastal” scenes are returned in global matching instead of 
“highway” as in our method. 


TABLE IV 

Performance comparison of different settings on the 
SIFTFlow dataset (4). Per-pixel classification rates (with 

PER-CATEGORY RATES IN PARENTHESES) ARE PRESENTED. 


Algorithm 

Performance 

Baseline 

SuperParsing |5J 

76.3 (28.8) 

Our Improvements 

SuperParsing + LDA + Global Matching + 
(fixed fc = 20) 

76.4 (31.2) 

SuperParsing + LDA + Super-pixel Matching 
+ (fc = 20) 

77.8 (33.5) 

SuperParsing + LDA + Super-pixel Matching 
+ Adaptive fc 

78.9 (34.0) 


and per-category of different fcs is inconsistent. The smaller 
fcs (< 20) tend to achieve a higher per-category whereas 
the larger ks lean to a higher per-pixel rate. A lower k 
responds well with rare categories (i.e., boat, pole, bus, etc. as 
illustrated in Figure]?]), thus it leads to improved per-category 
classification. Meanwhile, a higher k leads to better per-pixel 
accuracy since it works well for more common categories such 
as sky, building, and tree, k = 5 yields the largest per-category 
rate, but its per-pixel performance is much lower than that of 
k = 40. As a closer look, Figure [4] also shows the details of 
per-category classification rates of different fcs. The smaller fcs 
yield better results on the categories with a small number of 
samples while the larger fcs are sensitive on categories with 
a large number of samples such as sky, sea, etc. As observed 
in the same Figure [4] our adaptive nonparametric approach 


TABLE v 

The evaluation of the relevance of a retrieval set with 

RESPECT TO A QUERY. 


Retrieval Set Algorithm 

NDCG 

GIST-based matching 01 

0.83 

Global matching [5] 

0.85 

Super-pixel matching 

0.88 


exhibits advantages over smaller and larger fcs. 

How each new component affects SuperParsing (U In 
order to study the impact of each newly proposed component, 
another experiment is conducted with different configuration 
settings. Namely, we report the results by incrementally adding 
LDA, super-pixel matching and adaptive nonparametric super¬ 
pixel classification to the traditional nonparametric image 
parsing pipeline 0 , respectively. Keeping the fixed fc as 
20 and the number of similar images in the retrieval set 
as 200, as recommended in 0 and adding LDA increase 
the performance of 0 by a small margin. We observe a 
large gain by adding super-pixel matching, i.e., 1.4%, in per- 
pixel rate. Further adding adaptive nonparametric super-pixel 
classification drastically increases the combination (0, LDA, 
super-pixel matching, and fixed fc = 20) by 1.1% in per- 
pixel rate. For comparison, our work improves 0 by 2 . 6 % 
in terms of per-pixel rate and 5.2% in terms of per-category 
rate. The results clearly show the efficiency of our proposed 
super-pixel matching and adaptive nonparametric super-pixel 
classification. Figure [6] shows the exemplar results of different 
experimental settings on the SIFTFlow dataset. 

How good is the locality-aware retrieval set We evaluate 
the performance of our retrieval set via Normalized Discounted 
Cumulative Gain (NDCG) ll29l which is commonly used to 
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Fig. 6. Exemplar results from the SIFTFlow dataset. In (a), the adaptive nonparametric method successfully parses the test image. In (b), the “rock” is 
classified instead of “river” or “mountain” in other two methods. In (c), our method recovers the “sun” and removes the spurious classification of the sun’s 
reflection in the water as “sky”. In (d), the labeled “sea” regions in two other methods are recovered as “road”. In (e), some of the trees are recovered in the 
adaptive nonparametric method. In (f), our method recovers “window” from “door”. Best viewed in color. 
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Fig. 7. Exemplar results on the 19-Category LabelMe dataset (28). The test images, ground truth, and results from our proposed adaptive nonparametric 
method are shown in triple batches. Best viewed in color. 


evaluate ranking systems. NDCG is defined as follows, 


NDCG@k r 



i=1 


2 rel(i ) _ ^ 

log(i + 1) ’ 


( 12 ) 


where rel(-) is a binary value indicating whether the scene 
of the returned image is relevant (with value 1) or irrelevant 
(with value 0) to the one of the query image, and Z is a 
constant to normalize the calculated score. Recall that k r is 
the number of returned images from locality-aware retrieval 
set to ensure the fair comparison. As shown in Table [VJ 
our super-pixel matching outperforms other baselines, namely, 
GIST-based matching and global matching in terms of NDCG. 
Figure [5] also demonstrates the good results of our locality- 
aware retrieval set. 


Adaptive k on different scene classes Based on our 
hypothesis that the similar images should share the same fc, 
we would like to study how the adaptive k selection works 
for different types (scene classes) of similar images. To this 
end, we divide images in the SIFTFlow dataset into scene 
classes based on their filenames. For example, the test image 
“coast_arnat59.jpg” is classified into coast scene class. In total, 
there are 8 scene classes, namely, coast, forest, highway, inside 
city, mountain, open country, street , and tall building. We 
compute the mean number of categories (car, building, road, 
etc.) inside the testing set of the SIFTFlow dataset. Next, we 
compute the selected k for each scene class by selecting the 
k that has the highest confidence over all of the images in the 
same scene. The mean number of categories and the selected 
k of each scene class are reported in Table VI As we can 
observe, the scene images with more object categories, i.e., 
highway, inside city and street, have lower fcs. In contrast, the 
scene images with fewer object categories have larger fcs. Note 
that our method is unaware of the scene class of the test image. 
This means our method adapts well to different scene classes 
and brings the remarkable improvement to image parsing. In 
the preliminary experiments, we apply the randomization for 
the order of test super-pixels but the performance is similar 
to the one that is from 1 to n q . Therefore, the order of the 
super-pixels of test image does not affect the performance. 


TABLE VI 

The mean number of categories and the correspondingly 

SELECTED k OF EACH SCENE CLASS ON THE SIFTFLOW DATASET. 


Scene Class 

Mean No. of Categories 

Selected fc 

Coast 

3.8 

12 

Forest 

2.5 

36 

Highway 

6.5 

6 

Inside City 

7.2 

12 

Mountain 

2.6 

22 

Open Country 

3.9 

14 

Street 

7.5 

6 

Tall Building 

3.3 

43 


TABLE VII 

Performance comparison of our algorithm with other 
ALGORITHMS ON THE 19-CATEGORY LABELME DATASET |28l . 
Per-pixel rates and average per-category rates are presented. 
The best performance values are marked in bold. 


Algorithm 

Per-Pixel (%) 

Per-Category (%) 

Parametric Baselines 

Jain et al. 1281 

59.0 

- 

Chen et al. 1301 

75.6 

45.0 

Nonparametric Baselines 

Myeong et al. (6) 

80.1 

53.3 

Adaptive Nonparametric Algorithm 

Super-pixel Classifica¬ 
tion 

80.3 

53.3 

Contextual Smoothing 

82.7 

55.1 


C. Performance on 19-Category LabelMe Dataset 

Table |VII| shows the performance of our work compared 
with other baselines on the 19-Category LabelMe dataset. Our 
final adaptive nonparametric method on this dataset achieves 
82 . 7 %, surpassing all state-of-the-art performances. For the 
adaptive nonparametric method, our result has surpassed the 
one of Myeong et al. m by a large margin. Compared with the 
parametric method f30lL our work improves by 7 . 1 %. Some 
exemplar results on this dataset are shown in Figure [7] 
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IV. Conclusions and Future Work 

This paper has presented a novel approach to image parsing 
that can take advantage of adaptive nonparametric super-pixel 
classification. To the best of our knowledge, we are the first 
ones to exploit the locality-aware retrieval set and adaptive 
nonparametric super-pixel classification in image parsing. Ex¬ 
tensive experimental results have clearly demonstrated the 
proposed method can achieve the state-of-the-art performance 
on diverse and challenging image parsing datasets. 

For future work, we are interested in exploring possible 
extensions to improve the performance. For example, the 
combination weight of different types of features can be 
learned. Another possible extension is to elegantly transfer 
other parameters apart from k, for example, the A of the con¬ 
textual smoothing process from the retrieved training images to 
the test image. Since the current solution is specific for image 
parsing, we are also interested in generalizing the proposed 
method to other recognition tasks, such as image retrieval, 
and general fc-NN classification applications. We also plan to 
leverage our work to video domain, i.e., action recognition ED 
and human fixation prediction 11321 . 

Last but not least, to boost the super-pixel matching pro¬ 
cess, we can embed Locality-sensitive hashing (LSH) f33l 
or the recently introduced Set Compression Tree (SCT) El 
to encode the features representative in few bits (instead of 
bytes) for large-scale matching. These coding methods and 
the insignificant number of super-pixels of each image make 
our super-pixel matching process feasible. In this paper, we 
only investigate the impact of adaptive non-parametric method 
in scene parsing. The utilization of LSH or SCT which are 
suitable for large-scale dataset will be considered for building 
a practical system in the future. 
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